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VOICE TAGGING, VOICE ANNOTATION, AND SPEECH RECOGNITION FOR 
PORTABLE DEVICES WITH OPTIONAL POST PROCESSING 

FIELD OF THE INVENTION 
[0001] The present invention generally relates to tagging of captured 
media for ease of retrieval, indexing, and mining, and particularly relates to a 
tagging and annotation paradigm for use on-board and subsequently with respect 
to a portable media capture device. 

BACKGROUND OF THE INVENTION 

[0002] Today's tasks relating to production of media, and especially 
production of multimedia streams, benefit from text labeling of media and 
especially media clips. This text labeling facilitates the organization and retrieval 
of media and media clips for playback and/or editing procedures relating to 
production of media. This facilitation is especially prevalent in production of 
composite media streams, such as a news broadcast composed of multiple 
media clips, still frame images, and other media recordings. 

[0003] In the past, such tags have been inserted by a technician 
examining captured media in a booth at a considerable time after capture of the 
media with a portable media capture device, such as a video camera. This 
intermediate step between capture of media and production of a composite 
multimedia stream Is both expensive and time consuming. Therefore, it would be 
advantageous to eliminate this step using speech recognition to insert tags by 
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voice of a user of a media capture device immediately before, during, and/or 
immediately after a media capture activity. 

[0004] The solution of using speech recognition to insert tags by voice 
of a user of a media capture device immediately before, during, and/or 
immediately after a media capture activity has been addressed in part with 
respect to still cameras that employ speech recognition to tag still images. 
However, the limited speech recognition capabilities typically available to 
portable media devices prove problematic, such that high-quality, meaningful 
tags may not be reliably generated. Also, a solution for tagging relevant portions 
of multi-media streams has not been adequately addressed. As a result, the 
need remains for a solution to the problem of high-quality, meaningful tagging of 
captured media on-board a media capture device with limited speech recognition 
capability that is suitable for use with multi-media streams. The present invention 
provides such a solution. 



SUMMARY OF THE INVENTION 
[0005] In accordance with the present invention, a media capture 
device has an audio input receptive of user speech relating to a media capture 
activity in close temporal relation to the media capture activity. A plurality of 
focused speech recognition lexica respectively relating to media capture activities 
are stored on the device, and a speech recognizer recognizes the user speech 
based on a selected one of the focused speech recognition lexica. A media 
tagger tags captured media with text generated by the speech recognizer, and 
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tagging occurs based on close temporal relation between receipt of recognized 
user speech and capture of the captured media. A media annotator annotates 
the captured media with a sample of the user speech that is suitable for input to a 
speech recognizer, and annotating is based on close temporal relation between 
receipt of the user speech and capture of the captured media. 

[0006] Further areas of applicability of the present invention will 
become apparent from the detailed description provided hereinafter. It should be 
understood that the detailed description and specific examples, while indicating 
the preferred embodiment of the invention, are intended for purposes of 
illustration only and are not intended to limit the scope of the invention. 

BRIEF DESCRIPTION OF THE DRAWINGS 

[0007] The present invention will become more fully understood from 
the detailed description and the accompanying drawings, wherein: 

[0008] Figure 1 is an entity relationship diagram depicting a media 
tagging system according to the present Invention; 

[0009] Figure 2 is a block diagram depicting a media capture device 
according to the present invention; 

[0010] Figure 3 is a block diagram depicting focused lexica according 
to the present invention; 

[0011] Figure 4 is a block diagram depicting tagged and annotated 
media according to the present invention; and 
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[0012] Figure 5 is a flow diagram depicting a media tagging metliod for 
use witli a media capture device according to tfie present invention. 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 
[0013] The following description of the preferred embodiment(s) is 
merely exemplary in nature and is in no way intended to limit the invention, its 
application, or uses. 

[0014] The system and method of the present invention obtains the 
advantage of eliminating the costly and time consuming step of insertion of tags 
by a technician following capture of the media. To accomplish this advantage, 
the present invention focuses on enabling insertion of tags by voice of a user of a 
media capture device immediately before, during, and/or immediately after a 
media capture activity. An optional, automated post-processing procedure 
improves recognition of recorded user speech designated for tag generation. 
Focused lexica relating to device-specific media capture activities improve quality 
and relevance of tags generated on the portable device, and pre-defined focused 
lexica may be provided online to the device, perhaps as a service of a provider of 
the device. 

[0015] Out-of-vocabulary words still result in annotations suitable for 
input to a speech recognizer. As a result a user who recorded the media can use 
the annotations to retrieve the media content using sound similarity metrics to 
align the annotations with spoken queries. As another result, the user can 
employ the annotations with spelled word input and letter-to-sound rules to edit 
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the lexicon on-board the media capture device and simultaneously generate 
textual tags. As a further result, the annotations can be used by a post- 
processor having greater speech recognition capability than the portable device 
to automatically generate text tags for the captured media. This post-processor 
can further convert textual tags associated with captured media to alternative 
textual tags based on predetermined criteria relating to a media capture activity. 
Automated organization of the captured media can further be achieved by 
clustering and indexing the media in accordance with the tags based on semantic 
knowledge. As a result, the costly and time consuming step of post-capture tag 
insertion by a technician can be eliminated successfully. It is envisioned that 
captured media may be organized or indexed by clustering textual tags based on 
semantic similarity measures. It is also envisioned that captured media may be 
organized or indexed by clustering annotations based on acoustic similarity 
measures. It is further envisioned that clustering can be accomplished in either 
manner onboard the device or on a post-processor. 

[0016] The entity relationship diagram of Figure 1 illustrates an 
embodiment of the present invention that includes a lexica source 10 distributing 
predefined, focused lexica 12 to a media capture device 14 over a 
communications network 16, such as the Internet. A manufacturer, distributor, 
and/or retailer of one or more types of media capture devices 14 may select to 
provide source 10 as a service to purchasers of device 14, and source 10 may 
distribute lexica 12 to different types of devices based on a type of device 14 and 
related types of media capture activities performed with such a device. For 
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example, lexica relating to recording music may be provided to a digital sound 
recorder and player, but not to a still camera. Similarly, lexica relating to 
recording specific forms of wildlife, as with bird watching activities, may be 
provided to a still camera, but not to a digital recorder and player. Further, both 
types of lexica may be provided to a video camera. As will be readily 
appreciated, the types of media capture activities that may be performed with 
device 14 are limited by the capabilities of device 14, such that lexica 12 may be 
organized by device type according to capabilities of corresponding devices 14. 

[0017] Device 14 may obtain lexica 12 through post-processor 18, 
which is connected to communications network 16. It is envisioned, however, 
that device 14 may alternatively or additionally be connected, perhaps wirelessly, 
to communications network 16 and obtain lexica 12 directly from source 10. It is 
further envisioned that device 14 may access post-processor 18 over 
communications network 16, and that post processor 18 may further be provided 
as a service to purchasers of device 14 by a manufacturer, distributor, and/or 
retailer of device 14. Accordingly, source 10 and post-processor 18 may be 
identical. 

[0018] Figure 2 illustrates an embodiment of device 14 corresponding 
to a video camera. Accordingly, predefined and/or edited focused lexica arrive at 
external data interface 20 of device 14 as extemal data input/output 22. Lexicon 
editor 24 stores the lexica in lexica datastore 26. The lexica preferably provide a 
user navigable directory structure for storing captured media, with each focused 
lexicon associated with a destination folder of a directory tree structure illustrated 



6 



Attorney Docket No. 9432-000247 

in Figure 3. For instance, a user folder 28 for storing media of a particular user 
contains various subfolders 30A and SOB relating to particular media capture 
activities. Each folder is preferably voice tagged to allow a user to navigate the 
folders by voice employing a system heuristic that relates matched speech 
models to folders and subfolders entitled with a descriptive text tag 
corresponding to the speech models. 

[0019] Threads relate matched speech models to groups of speech 
models. For example, a user designated as "User A' may speak the phrase 
"User A" into an audio input of the device to specify themselves as the current 
user. In response, the device next employs folder lexicon 32 for "User A" based 
on the match to the voice tag 34 for user folder 28. Thus, when the user next 
speaks "Business" and the device matches the speech input to voice tag 36 for 
sub-folder 30B, two things occur. First, sub-folder 30B is selected as the folder 
for storing captured media. Second, focused lexicon 38 is selected as the current 
speech recognition lexicon. A user lexicon containing voice tag 34 and other 
voice tags for other users is also active so that a new user may switch users at 
any time. Thus, a switch in users results in a shift of the current lexicon to a 
lexicon for the subfolders of the new user. 

[0020] Returning to Figure 2, a user interface 40A is also provided to 
device 14 that includes user manipulable switches, such as buttons and knobs, 
that can alternatively or additionally be used to specify a user identity or 
othenwise navigate and select the focused lexica. An interface output 40B is also 
provided in the form of an active display and/or speakers for viewing and 
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listening to recorded media and monitoring input to video and audio inputs 42A 
and 42B. In operation, a user may select a focused lexicon and press a button of 
interface 40A whenever he or she wishes to add a voice tag to recorded media. 
For example, the user may select to start recording by activating a record mode 
44, and add a voice tag relating to what the user is about to start recording. 
Audio and video input 46 and 48 are combined by media clip generator into a 
media clip 52. Also, the portion of the audio input 46 that occurred during the 
pressing of the button is sent to speech recognizer 54 as audio clip 56. This 
action is equivalent to performing speech recognition on an audio portion of the 
media clip during pressing of the button. 

[0021] Speech recognizer 54 employs the currently selected focused 
lexicon of datastore 26 to generate recognition text 58 from the user speech 
contained in the audio clip 56. In turn, media clip tagger 60 uses text 58 to tag 
the media clip 52 based on the temporal relation between the media capture 
activity and the tagging activity. Tagger 60, for example, may tag the clip 52 as a 
whole with the text 58 based on the text 58 being generated from user speech 
that occurred immediately before or immediately after the start of filming. This 
action is equivalent to placing the text 58 in a header of the clip 52. Alternatively, 
a pointer may be created between the text and a specific location in the media 
clip in which the tag is spoken. Further, media clip annotator 62 annotates the 
tagged media clip 64 by storing audio clip 56 containing a sample of the user 
speech suitable for input to a speech recognizer in device memory, and 
instantiating a pointer from the annotation to the clip as a whole. This action is 
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equivalent to creating a pointer to the header or to the text tag in the header. 
This action is also equivalent to creating a general annotation pointer to a portion 
of the audio clip that contains the speech sample. 

[0022] Results of tagging an annotation activity of a multimedia stream 
according to the present invention are illustrated in Figure 4. This example 
employs pointers between textual tags and locations in the captured media 
based on a time at which the tagging occurred during filming of a sports event 
such as a football game. Also, a user identifier 66 and time and date 68 of the 
activity are recorded in relation to the media stream 70 at the beginning of the 
stream 70 as a type of header. Further, the user may select a prepared, focused 
lexicon for recording sports events and identify the type of sports event, the 
competitors and the location at the beginning of the media stream 70. As a 
result, and textual tags 72A-C are recorded in relation to the beginning of the 
media stream with information relating to the confidence levels 74A-C of the 
respective recognition attempts. A predetermined offset from the pointer 
identifies the portion of the media stream 70 in which the annotation 76 is 
contained for tags 72A-C. Subsequent tagging attempts result in similar tags, 
and failed recognition attempts 78 are also recorded so that an related 
annotation is created by virtue of the pointer and offset. It is envisioned that 
alternative tagging techniques may be employed, especially in the case of 
instantaneously captured media, captured media having no audio component, 
and/or captured media with multiple, dedicated audio inputs. For example, a still 
camera may record annotations and any successfully generated tags with 
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general pointers from recording of user speech and related text to a digitally 
stored Image. Also, a video cassette recorder may record a multimedia 
broadcast received on a cable, and may additionally receive user speech via a 
microphone of a remote control device and record it separately. Thus, the user 
annotation need not be integrated into the multimedia stream. 

[0023] Returning to Figure 2, tagged and/or annotated captured media 
80 stored in the directories provided by the focused lexica may be retrieved by 
the user employing clip retriever 82. Accordingly, the user enters a retrieval 
mode 84 and utters a speech query. Speech recognizer 54 is adapted to 
recognize the speech query using the corpus of the focused lexica of datastore 
26, and to match recognition text to tags of the captured media. A list of 
matching clips are thus retrieved and presented to the user for final selection via 
interface output 40B, which communicates the retrieved clip 86 to the user. Also, 
speech recognizer 54 is adapted to use sound similarity metrics to align the 
annotations with spoken queries, and this technique reliably retrieves clips for a 
user who made the annotation. Thus, speech recognizer may take into account 
which user is attempting to retrieve clips when using this technique. 

[0024] Annotations related to failed recognition attempts or low 
confidence tags may be presented to the user that made those annotations for 
editing. For low confidence tags, the user may confirm or deny the tags. Also, 
the user may enter a lexicon edit mode and edit a lexicon based on an 
annotation using spelled word input to speech recognizer 54 and letter to sound 
rules. Speech recognizer 54 also creates a speech model from the annotation in 
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question, and lexicon editor 24 constructs a tag from tlie text output of recognizer 
54 and adds it to the current lexicon in association with the speech model. 
Finally, captured media 80 may be transmitted to a post-processor via external 
data interface 20. 

[0025] Returning to Figure 1 , post-processor 18 has speech recognizer 
89 that is enhanced compared to that of device 14. In one respect, the 
enhancement stems from the use of full speech recognition lexicon 90, which has 
a larger vocabulary than the focused lexica of device 14. Post-processor 1 8 thus 
receives at least annotations from device 14 in the form of extemal data 
input/output 22, performs speech recognition on the received annotations to 
generate textual tags for the related, captured media. In one embodiment, post- 
processor 18 is adapted to generate tags for annotations associated with 
recognition attempts that failed and/or that produced tags of low confidence. It is 
envisioned that post-processor 18 may communicate the generated tags to 
device 14 as external data input/output 22 for association with the related, 
captured media. It is further envisioned that post-processor 18 may receive the 
related and possibly tagged, captured media as external data input/output 22, 
and store the tagged and/or annotated media in datastore 92. In such a case, 
post-processor 18 may supplement the annotated media of datastore 92 by 
adding tags to captured media based on related annotations. Additionally, post- 
processor 18 may automatically generate an index 94 for the captured media of 
datastore 92 or stored on device 14 using semantic knowledge, clustering 
techniques, and/or mapping module 96. 
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[0026] Semantic knowledge may be employed in constructing index 94 
by generating synonyms for textual tags that are appropriate In a context of 
media capture activities in general, in a context of a type of media capture 
device, or in a context of a specific media capture activity. Image feature 
recognition can further be employed to generate tags, and the types of image 
features recognized and/or tags generated may be focused toward contexts 
relating to media capture activities, devices, and/or users. For example, a still 
camera image may be recognized as a portrait or landscape and tagged as such. 

[0027] Clustering techniques can further be employed to categorize 
and otherwise commonly index similar types of captured media, and this 
clustering may be focused toward contexts relating to media capture activities, 
devices, and/or users. For example, the index may have categories of "portrait", 
"landscape", and "other" for still images, while having categories of "sports", 
"drama", "comedy", and "other" for multimedia streams. Also, subcategories may 
be accommodated in the categories, such as "mountains", "beaches", and 
"cityscapes" for still images under the "landscape" category. 

[0028] Mapping module is adapted to convert textual tags associated 
with captured media to alternative textual tags based on predetermined criteria 
relating to a media capture activity. For example, the names and numbers of 
players of a sports team may be recorded in a macro and used during post- 
processing to convert tags designating player numbers to tags designating player 
names. Such macros may be provided by a manufacturer, distributor, and/or 
retailer of device 14, and may also focus toward contexts relating to media 
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capture activities, devices, and/or users. The index 94 developed by post- 
processor 18 may be developed based on captured media stored on device 14 
and further transferred to device 14. Thus, the post-processor 18 may be 
employed to enhance functionality of device 14 by periodically improving 
recognition of annotations stored on the device and updating an index on the 
device accordingly. It is envisioned that these services may be provided by a 
manufacturer, distributor, and/or retailer of device 14 and that subscription fees 
may be involved. Also, storage services for captured media may be additionally 
provided. 

[0029] Any of the mapping, semantics, and/or clustering may be 
customized by a user as desired, and this customization ability is extended 
toward focused lexica as well. For example, the user may download initial 
focused lexica 12 from source 10 and edit the lexica with editor 98, employing 
greater speech recognition capability to facilitate the editing process compared to 
an editing process using spelled word input on device 14. These customized 
lexica can be stored in datastore 100 for transfer to any suitably equipped device 
14 that the user selects. As a result, the user may still obtain the benefits of 
previous customization when purchasing additional devices 14 and/or a new 
model of device 14. Also, focused lexica that are edited on device 14 can be 
transferred to datastore 100 and/or to another device 14. 

[0030] The method according to the present invention is illustrated in 
Figure 5, and includes storing focused lexica on the media capture device at step 
102. It is envisioned that the lexica may be edited prior to transfer to the device. 
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The method further includes step 104 of receiving user input specifying a mode 
of operation of the device. It is envisioned that the mode may be specified by 
manipulation of a switching mechanism of a manual user interface, and/or by 
speech input and keyword recognition. Step 104 also includes specification of a 
user Identity by a switching mechanism, speech recognition, and/or voice print 
recognition. Step 106 includes receiving a user speech input via an audio input 
of the device, and this speech input is designated for operating the device during 
the previously specified mode of operation. Accordingly, the speech input is 
recognized in step 108 based on a currently specified lexicon, which may be 
related to a specific media capture activity. Preferably, a lexicon containing 
names of folders corresponding to names of media capture activities remains 
open to supplement the current lexicon. Thus, if the recognized speech input 
corresponds as at 110 to one of the media capture activities, then the folder for 
that activity is activated at step 112. Activation of this folder causes the lexicon 
of that folder to be designated as the current lexicon, and processing returns to 
step 108. It is envisioned that modes of device operation may similarly be 
recognized by speech, and/or that folders may be selected by manual input from 
a user interface. 

[0031] Speech input that does not designate a new mode or activity 
category is used to operate the device according to the designated mode. For 
example, if the device is in tag mode, then any text generated during the 
recognition attempt on the input speech using the folder lexicon at step 108 is 
used to tag the captured media at step 116. The speech sample is used to 
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annotate the captured media at step 118, and the captured media, tag, and 
annotation are stored in association with one another in device memory at step 
120. Also, if the device is in lexicon edit mode, then a current lexicon uses letter 
to sound rules to generate a text from input speech for a selected annotation, 
and the text is added to the current lexicon in association with a speech model of 
the annotation at step 122. Further, if the device is in retrieval mode, then an 
attempt is made to match the input speech to either tags or annotations of 
captured media and to retrieve the matching captured media for playback at step 
124. Additional steps may follow for interacting with an external post processor. 

[0032] It should be readily understood that the present invention may 
be employed in a variety of embodiments, and is not limited to initial capture of 
media, even though the invention is developed in part to deal with limited speech 
recognition capabilities of portable media capture devices. For example, the 
invention may be employed in a portable MP3 player that substantially 
instantaneously records previously recorded music received in digital form. In 
such an embodiment, an application of the present invention may be similar to 
that employed with digital still cameras, such that user speech is received over 
an audio input and employed to tag and annotate the compressed music file. 
Alternatively or additionally, the present invention may accept user speech during 
playback of downloaded music and tag and/or annotate temporally 
corresponding locations in the compressed music files. As a result, limited 
speech recognition capabilities of the MPS player are enhanced by use of 
focused lexica related to download and/or playback of compressed music files. 
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Thus, download and/or playback of previously captured media may be 
interpreted as a recapture of the media, especially where tags and/or annotations 
are added to recaptured media. 

[0033] It should also be readily understood that the present invention 
may be employed in altemative and/or additional ways, and is not limited to 
portable media capture devices. For example, the invention may be employed in 
a non-portable photography or music studio to tag and annotate captured media 
based on focused lexica, even though relatively unlimited speech recognition 
capability may be available. Further, the present invention may be employed in 
personal digital assistants, lap top computers, cell phones and/or equivalent 
portable devices that download executable code, download web pages, and/or 
receive media broadcasts. Still further, the present invention may be employed 
in non-portable counterparts to the aforementioned devices, such as desk top 
computers, televisions, video cassette recorders, and/or equivalent non-portable 
devices. Moreover, the description of the invention is merely exemplary in nature 
and, thus, variations that do not depart from the gist of the invention are intended 
to be within the scope of the invention. Such variations are not to be regarded as 
a departure from the spirit and scope of the invention. 
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